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ABSTRACT 


As the inform ationagecom es to fru ition, terro rist ne tworks havem oved 
mainstream by prom oting their causes via th e World W ide Web. In addition to _ their 
standard rhetoric, these organizations provi de anyone with an Internet connection the 
ability to ac cess dange rous inf ormation i nvolving the creation a nd implementation of 
Improvised Explosive Devices (IEDs). Unfortunately for governm ents combating 
terrorism, IED education networks can be ve _ ry difficult to find an d even harder to 
monitor. Regular com mercial search engines ar e not up to this task, as they have been 
optimized to catalog infor mation quickly and e fficiently for user ease of access _ while 
promoting retail commerce at the same time. This thesis presents a performance analysis 
of a new search engine algorithm designed to help find IED education networks using the 
Nutch open-source search engine architecture. It rev eals which web pages are more 
important via references from other web pages regardless of domain. In addition, this 
thesis discusses potential evaluation and monitoring techniques to be used in conjunction 


with the proposed algorithm. 
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EXECUTIVE SUMMARY 


As the Global War on Terrorism has progressed, the use of Improvised Explosive 
Devices (IE Ds) against coalition forces, governments and civilian populations fighting 
terrorism has drastically increased. One reas on for this is easy access to the World Wide 
Web [1]. The W orld Wide Web provides anyone with both acom puter and Internet 


connection access to a plethora of inform ation within the touch of a button; any _ thing 


from encyc lopedias to current news, pictures to m ovies, basic chem istry to the 
construction of IEDs. In conjunction with this dangerous inform ation being easily 
accessible, the users and publishers have the po tentialto rem ain anonym ous. 


Complicating things f urther, te rrorist o rganizations are exploiting this resource by 
creating IE D education networks via the W_ orld W ide W eb to quickly and efficiently 


propagate the information to their supporters and operatives. 


One possible solution to this problem is an IED specific WebCrawler. An IED 
WebCrawler has the potentia | to quickly loca te terrorist IED educa tion networks via the 
World Wide Web. Onc e found, these networks can be either shutdown, m onitored, or 
infiltrated depending on the objectives of the government or agency employing the search 
engine. By locating these networks, responsibi lity for particular att acks can be properly 
assigned to specific terrorist networks, with particular IED counter measures deployed to 


prevent further loss of life and damage to property. 


To accomplish this, the Nutch project was se lected as the optimum search engine 
to use. Its versatile plug-in architecture allows for the flexibility needed to design an IED 
specific WebCrawler while keeping implementation costs low. To improve performance, 
the original algorithm was m odified to dr amatically enh ance th e w eb-link scores of 
documents already discovered during a search. Multiple simulations were used to test the 


new algorithm variations with moderate success. 


Overall, the Nutch search engine is well _ suited for the above task, as well as 
monitoring the newly discovered networks. Under its current design, Nutch is capable of 


maintaining a previously found web-link database while upda ting it with new documents 


Xlli 


and scores. Inflation issues concerning we b-link scores arise depending on the num ber 
and frequency of re-crawls conducted but _is m inor unless looking to discover new 
networks after an initial craw |. This thesis does not ad_ dress foreign language issues, 
robot exclusion protocols or other security measures used to prevent search engines from 


accessing a web page. 
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I. INTRODUCTION 


A. PROBLEM OVERVIEW 


After the terrorist attacks of Septembe r 11, 2001, the United States of America 
was forced to deal with a threat the likes of which had neve r been seen before. As mall 
network of individuals was able to effectively kill thou sands of people with m_ ultiple 
airborne Improvised Explosive Devices (IEDs). Following the attacks, the U.S. launched 
the Global W ar on Terror ism; am assive anti-te rrorism cam paign with the go als of 
bringing to justice the people responsible for the 9/11 a ttacks, as we ll as the te rrorist 
organization that planned it, al-Qaeda. The en d state ob jective of the campaign is to 
continue to prevent the emergence and sustainment of other terrorist organizations, while 
permanently degrad ing the ab ilities of thes e organizations to engage in terrori sm 


effectively. 


As the Global War on Terrorism has progressed, the use of IEDs against coalition 
forces, governments and civilian populations fi ghting terrorism has drastically increased. 
One reason for this is easy access to the World Wide Web [1]. The World W ide Web 
provides anyone with b oth a com puter and In ternet connection access to a pletho ra of 
information within the touch of a button; an ything from encyclopedias to current news , 
pictures to movies, basic chemistry to the construction of IEDs. In conjunction with this 
dangerous information being easily accessible, the users and publishers have the potential 
to remain anonymous. Complicating things further, terrorist organizations are exploiting 
this resource by creating IED education netw_ orks via the World W ide Web to qui ckly 


and efficiently propagate the information to their supporters and operatives. 


One possible solution to this problem is an IED specific WebCrawler. An IED 
WebCrawler has the potentia | to quickly loca te terrorist IED educa tion networks via the 
World Wide Web. Onc e found, these networks can be either shutdown, m onitored, or 


infiltrated depending on the objectives of the government or agency employing the search 


engine. By locating these networks, responsibi lity for particular att acks can be properly 
assigned to specific terrorist networks, with particular IED counter measures deployed to 


prevent further loss of life and damage to property. 
B. RESEARCH OBJECTIVES 


The research objectives of this thesis were to create a random network generator 
capable of generating a random network to be us ed in testing the effectiveness of search 
engine algorithm s, while sim ultaneously de veloping a new search engine algorithm 
aimed at id entifying IED educatio n networ ks acces sible via the World W ide Web. 
Additionally, this thesis will briefly mention how an IED WebCrawler could be modified 
and used as a monitoring device, successfully tracking ch anges and upd ates to the IED 


education networks. 
C. THESIS ORGANIZATION 


This thesis consists of six chapters. The present chapter states an overview of the 
problem, objectives, and thesis organization. Chapter II contains a_ brief description of 
IEDs, retrieval strategies and a current surv ey of web crawling algorith ms. Chapter III 
describes th e Nutch op en-source s earch eng ine project. Chapte rIV discusses the 
development of a new search engine algor _ithm. Chapte r V pr esents the subje ctive 
performance m easurements, com pares diffe rent algor ithms and determines re lative 
effectiveness. Chapter VI summarizes this thesis, draws conclusions and provides future 


research recommendations. 


I. BACKGROUND 


A. THE IED THREAT 


1. Definition 


In 2008, the United States Department of Defense updated the definition of an 
Improvised Explosive Device as: 
a device placed or fab ricatedin anim _provised m anner incorporating 


destructive, le thal, nox ious, pyrotechnic, or incendiary chem icals an d 
designed to destroy, incapacitate, harass, or distract. [2] 


Previously, an IED was only thought to —_— incorporate m ilitary stores with non- 
military co mponents, but this co ncept isch anging. Militaries aro und the world are 
incorporating off-the-shelf commercial technology to lower production costs, blurring the 
line between military and non-m ilitary components. W hat makes an IED special is the 
fact that som e part of the device, generally withregards tothe triggering or delivery 


mechanism, is altered from its original manufactured state to an "improvised" one. 
2. 


The reason a standard IED definition is hard to agree upon is due to this fact: 
IEDs are "improvised." For example, there are over 16 commonly used acronyms within 
the U.S. m ilitary to des cribe different IE Ds, with no realc onsensus on how they are 
specifically classified: Chemical and Biological IED (CBIED), Command Detonated IED 
(CDIED), Chem ical IED (CIED), Comm and Wire IED (CW IED), Deep Buried IED 
(DBIED), Explosively Form ed Penetrator (EFP), House-Borne IED (HBIED), Hom e 
Made Explosives (HME), Im provised Anti-Armor Grenade ([AAG), Person-Borne IED 
(PBIED), Radio-Contro lled IED (RCIED), Suicide IED (SI ED), Suicide Vehicle-Borne 
IED (SVBIED), Vehicle-Borne IE D (VBIED), Victim Operated IED (VOIED), Water- 
Borne IED (W BIED). Other examples includ e "sticky" and "flying" IEDs, specifically 
referencing m agnetic and rocket as_ sisted mortars. Overall, thereis noeasy way to 


classify all of the different potential types of IEDs. 


25 Generic IED Composition 


In general, an Im provised Explosive Device works by completing an explosive 
train from s tart to finish. An explosive _ train is defined by the U.S. Departm _ ent of 
Defense as "a succession of initiating and igniting elements arranged to cause a charge to 
function [2]." Figure | provides a generic line diagram of an IED explosive train. At the 
beginning of the chain, a fuse is needed to _ initiate the reaction, with an accompanying 
agent being the m eans of ignition. Fuse ex amples range greatly from a slow burning 
piece of twine or cotton to atrailo fblack powder, etc...; b ut all requ ire some type of 
ignition source to start the chain reaction. Next is the primer, which is a container that 
holds the explosive agent. A detonator, al so known as a_ blasting cap, is then used to 
create a sm all explosion which will cause the m ain charge to ign ite. Safety relays and 
arming leads are usually incorpo rated in the de vice in order to prevent early detonation. 
Booster charges are optional depending on the main charge composition. If the explosive 
agent being used requires a la rge amount of energy to ignite its chemical agent, then a 
booster charge will be required. Multiple booster charges can be used to create a cascade 


effect if the main charge is in need of the extra energy. 
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Figure 1. Representation of a generic Explosive Train 


Another way to look at [EDs is from an electrical point of view, provided in 
Figure 2. Initially, a power source is needed to start the reaction. Power sources for such 


devices range in various sizes, fromas mall 9V battery to a large car or truck battery . 
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Essentially, anything can be used as a power source, as long as it has the ability to store a 
voltage potential and deliver enough current to initiate the explosive reaction. Next, an 
optional arm ing switch can be incorporat __ ed in the device to prevent prem ature 
detonation; otherwise a direct connection would bem ade. A trigger is then used to 


complete the circuit, allowing the blasting cap to ignite the main charge. 
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Figure 2. Generic Improvised Explosive Device Electrical Diagram 


3; Brief History of Use 


Throughout all of mankind's history, many different groups of people have turned 
to violent means in order to further a cause; whether through formal military measures or 
small pockets of resistance against a common foe. In general, small groups with minimal 
amounts of money were forced to become creative in order to effectively attack their 
enemies, furthering their objectives. The first prominent example of IED use came in the 
20th century during the Belarus ""R ail War." In 1943, Belarusian partisans waged war 
with IEDs against the G erman army; disrupting supply lines and de stroying garrisons in 
order to prevent their advance [3]. During the Vietnam War, Viet C ong soldiers used 
numerous IEDs against Am erican forces, cau sing approxim ately one third of all U.S. 
casualties [4]. Since then, num erous separatists groups located wo rldwide have adopted 
their use, including groups lo cated in areas such as Nort hern Ireland, Iraq, Afghanistan, 


Israel, Lebanon and Chechnya. 


As the war in Iraq comes to a close, and the U.S. led war in Afghanistan rages on, 
it has become clear that terrorist groups' weapon of choice is the IED. I n response to the 
high casua Ity rates inb oth loca tions, the Unite d States c reated the Jo int IED Defeat 
Organization (JIEDDO) to com bat the growing epidemic. Since its inception, JIEDDO 
has effectively assisted in countering IED use; lowering the average num ber of IED 
events Coalition forces encounter each m onth in Iraq and Afghanistan to approxim ately 


900, down from a high of 2,800 in 2007 [5]. 
4. Current Concerns 


Unfortunately, with the advent of the W orld Wide Web, anyone with a com puter 
and Internet connection can find inform ation on how to create an IED. For exam ple, a 
well known anarchy book: The Jolly Roge __r's CookBook can easily be found online 
within minutes of a Google search involving terms related to IEDs: anarchy, bom b, and 
explosive [ 6]. This d_ etailed case -in-point illustrates just how vast the problem has 
become. Terrorist networks are exploiting th e Internet and creating vast IED education 


networks to further their cause. 
B. INFORMATION RETRIEVAL 


The science of information retrieval has come to the forefront of Internet research 
within the last two d ecades. Asmore and more people use search engines to find 
pertinent information, the need to properly classify relevant documents continues to grow 
and evolve. One succes s story demonstrating such importance is Google. Their s earch 
engine took into acco untm ore factors than any other, considerin g not ju st term 
frequencies, but "whether words or phrases on web pages were close together or far apart, 
what their font size was, whether they were capitalized or in lowercase type [7].” 
Learning to evaluate what information is important or not is the first step in developing a 
successful search algorithm. Different methods classifying retrieval strategies and known 


ranking algorithms are presented below. 


if Retrieval Strategies 


a. Vector Space Model 


The vector space m odel is aretrieval strategy widely used in som eof 
today's most successful WebCrawlers. The model works by representing each document 
as a vecto rin m ultiple dimensions, with the n umber of dimensions dependent on the 
quantity of terms entered into the query. Ifa term is found to be in a document, the value 
of the vector for that document is non-zero. These values or similarity coefficients (SCs) 
are then compared to determine which docum ents are the most releva nt to a given input 


query. Specific calculations involving similarity coefficients vary between WebCrawlers 


and are usually considered proprietary information. 


A simple term-by-document matrix example is presented in Table 1 with a 
document in each co lumn and corresponding te rm in each row. The value indicated 
represents the te rm's frequency w ithin tha td ocument. Inth is spe cific ca se, term 
frequency will be no m ore than one. For exam ple, Term 3 appears in bo th Document 2 
and Document 3 but not in the other example Documents. To further grasp this concept, 
Figure 3 demonstrates what Table 1' s term-by-document matrix looks like as a vector in 


3-dimensional space. If term frequencies were actually co nsidered in this exam ple, an 


additional normalizing factor would have to be applied to the matrix. 



































Document 1 | Document2 | Document3 | Document 4 
Term 1 1 0 1 0 
Term 2 0 0 1 1 
Term 3 0 1 1 0 
Table 1. | Small term-by-document matrix (From [8]). 
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Figure 3. | Representation of documents in a 3-dimensional vector space (From [8]). 


In general, problem s arise with this method duetothef act that the 
frequency of terms does not al ways correlate to relevance, nor does the single inclusion 
of a query term . The order in which term = s appear does not factor in as well. Other 
methods are used in conjunction with the vector space m odel to enhance the qu ality of 
WebCrawler's search results. Relevanc yranks varyamongth em and are solely 


dependent on the ranking algorithm. 
b. Language Model 


The language m_odel is defined as a "probabilistic m echanism for 
'generating' a piece of text” [9]. In other word _ s, it gen erates a dis tribution for all the 
possible word patterns and as signs a sim ilarity coefficient based on the lik elihood of a 
document generating a query. Contextual information can be used as well to generate the 
distribution for more complex algorithms. The difficulty involving this method is that a 
modelisb uilt foreachdocum ent, m aking them ethod extrem ely com putationally 


intensive. 


Cc. Probabilistic Retrieval 


Probabilistic retrieval has m any va riant form s but two funda mental 
approaches that differ based on usage patter ns and query term s. The first method 
involves usage patterns to predict relevance while the other uses query inform ation to 
determine relevance. I n[ 10], Fuhr shows tha t the prob ability of a docum ent will be 
relevant given a par ticular term estimate. Using a binary independence retrieval (BIR) 
model, he specifically demonstrates that "optimal retrieval quality can be achieved under 


certain assumptions." 


Unfortunately, probabilistic m odels are not v ery practical as they m_ ust 
work around two general assumptions: para meter estim ations and independence. 
Parameter estim ation refers to obtaining the param _ eter estim ates through the use of 
training set data. Without an accurate data set, it is very difficult to properly estimate the 
parameters, which equates directly to their relevance. Independence assum ptions on the 
other hand cause problems as well. For exam ple, it is clear that the presence of the term 
"big" increases the probability in the English language of the presence of the term "bang" 
in reference to the "big bang" theory. This assumption is normally required for the model 


to work, even though the assumption many not be very realistic. 
d. Inference Networks 


Inference networks, also known as Baye sian networks, are networks that 
take known relationships and "infer" other relationships from the information. By having 
the ability to infer information from previous relationships, less computation is needed to 
determine the probability that an event will — occur or be relevant. The best known 
example of an inference network being used to determ ine search engine results is 
contained within Google's PageRank algorithm a nd will be discussed in m ore detail in 


section B-2-e of this chapter. 


e. Extended Boolean Retrieval 


Conventional Boolean retrieval does not work very well when calculating 
relevance rankings, due to the fact that either the docum ent solely co ntains the query 
term, or does not. This problem potentially allows for a lot of documents to be marked as 
satisfying the input query, but not be rele vant, and vice versa. Extended Boolean 
retrieval adjusts th is co ncept by ap plying weig hts to the term senteredinthe query, 
known as term weights. These weights allow for the creation of a vector, with the 
difference being calculated out from the orig in to determ ine relevance matching. Most 
modern search engines incorporate extended _ Boolean retrieval within a part of their 


ranking algorithm [9]. 
T- Latent Semantic Indexing 


Latent Sem antic Indexing is am _ ethod recognizing that a single concept 
can be described by using many different words. Attempting to match only one or a few 
words with a particular concept will produc em any false results. By applying this 
knowledge, Single Value Decomposition (S_— VD) is used to generate as imilarity 
coefficient; filtering out the noise and enabling documents with similar lexical semantics 


to be located closer in multi-dimensional space. 
g. Neural Networks 


Neural Networks are a set of nodes, composed of i mportance values. 
When calculating a value to associate with each node, all of the values from the incoming 
nodes are used. A portion of or the entire node's value is then passed on through the links 
going out from it and used to calculate those n odes' values. Training s ets are needed to 
properly modify the weights of the links , ensuring satisfactory im _ portance value 


calculations. 
h. Fuzzy Set Retrieval 


Fuzzy set retrieval is a m ethod in which membership in a set is not solely 


based on having only elem ents that are in the set, but rather by applying a for mula to 
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calculate the SC, or "degree of membership" [9]. Boolean retrieval, union, intersection 

and complement operations are applied to determine the degree of membership. Another 
application used within "fuzzy set" retrieval is a spell ch eck function. This f unction 
attempts to prevent f alse results based solely on misspelled pages, as well as allowing 
misspelled pages to not be pena lized within the query results when they are relevant to a 


particular query. 
2. WebCrawler Algorithms 


Developing an algorithm to search and properly classi fy topics throughout the 

World Wide Web is a dif ficult task. Early s earch engines class ified information based 
solely on lexical sim ilarity and frequency [13]. These methods include Breadth-first, 
Best-first, Shark-search and Info-spiders. |W ith the m onolithic rise of Google and 
subsequent publishing of its PageRank concep t, hypertext link structure analysis became 
the primary tool for Web semantics [7]. Since then, multiple methods have been created 
using PageRank as their basis, with asurvey of such presented with in the section. In 
particular, Google’ s current algorithm has not been published, as it is considered 


proprietary information forming the basis of the company's business. 
a. Breadth-first 


The Breadth-first Search (BFS) algorithm was one of the first and simplest 
known crawling strategies to be used on th e World Wide Web. Developed in 1994 [11], 
it uses a First-in First-out (FIFO) queue method, crawling links in the order in which they 
are found. This m ethod uses a single seed, i. e., web pages, and continues crawling until 
all links are exhausted. An illustration outlin ing the basic method is sho wn in Figure 4. 
Figure 5 presents anexam ple BFStree diagram containing 15 links; the numbers 


representing the order in which the web page link is found and processed. 
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Figure 4. Breadth-first Crawler Outline. 





Figure 5. Breadth-first Crawler Tree Diagram Example. 


b. Best-first 


The Best-first algorithm isam _ ethod that uses som e type of estim ation 
criteria to determine which link to c rawl first, given a group of links located on a web 
page. The idea behind the Best-first algorithm is to efficiently navigate and download 
relevant pages first, while preventing m emory buffer overloads in the server conducting 
the crawl. An outline of the Best- first Crawler is p resented in Figure 6. According to 
[12], the Uniform Resource Locator (URL) link' s name is generally considered the best 
measure for estimating relevance, given that the name relates to a specific product, device 


or relevant field. Figure 7 presents an example of a Best-first Tree Diagram. 
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Web Page 


Link Similarity Estimator 











FIFO queue 


Figure 6. _ Best-first Crawler Outline. 





Figure 7. _ Best-first Crawler Tree Diagram Example. 


One example of a generic cosine SC formula used to discriminate relevant 


web pages is provided below: 


SC(Q,D) =>) Wy Xd; (2.1) 


where Q is a query weigh t vector and D isa specific docum ent vector, both of size f, 


which is the total number of specific terms in the query. d,, is defined as the term weight 
within the document. w,, is th e weight assigned for each specific query term , having 
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treated the query asa docum ent itself. Essentially, th is formula takes the anchor text 
pointing to another web page as a docum ent and compares it to the entered query. The 
more frequent the term s from the entered que ry are found in the anchor text, the higher 


the SC will become. 
Cc. Shark-search 


The Shark-search algorithm is esse ntially ahybrid of the Best-first 
method, using a m ore complicated function to ev aluate relevant links [14]. Scores for 
links are influenced by more factors than before, includi ng the text su rrounding links, 
anchor text and an inherited score derived from previous page. The value added to a 
search engine by using the Shark-search al _ gorithm is that link fetching relevanc e is 
determined by using a continuously changing value function as opposed to a standard 
binary function, allowing for a more refine dsearch. Overall, thism — ethods aves 
communication time by obtaining docum ents that are more likely to b e relevant f irst, 
leading to other docum ents that are more re levant later on. Figure 6, shown previously, 


illustrates the algorithm as well. 
d. Info-spiders 


Info-spiders are defined as independ ent agen ts gather ing inf ormation in 
parallel over the World Wide Web. Generally speaking, each agent contains a list of key 
words and evaluates a node or m_ ultiple nodes within a netw ork (i.e., web pages within 
the World Wide Web), looking for new nodes re lative to the key words entered. These 
agents "exh ibit an in telligent beh avior, be ing able to ev aluate the relevance of the 
document content with respect to the user’ s query, and to reason autonomously about 
future actions that m imic the brow sing habits of hum an users [15]." As the "Spiders" 
progress to new nodes within a network, the = amount of e nergy, or SC is calculated. 
Eventually, the value dr ops below a set thre shold, ending the search down a particular 
linked path. The cycle then repeats itself within different networks determ ined by the 


user. An example of such a program found freely on the Internet is MySpiders [15]. 
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Figure 8 is a standard Info-Spider ar chitecture representation, starting and 
ending the process with a user. To begin, a us er enters into the information environment, 
inputting the key words to be searched out over the World Wide Web. Next, the program 
fetches each page as a raw ht ml document. After the docu ment is retrieved, it is p arsed 
and saved in a compact format. Meanwhile, the document is weighted for the given key 
words and its outgoing links processed to determine the likelihood of finding the relevant 
key words within the next linked page. The process repeats until the energy or SC drops 
below a set thresho Id, ending the search. Multip —_le "S piders" or paths are taken 
simultaneously in parallel to speed up the pro cess. At the end of the process, a database 
has been developed and indexed relative to the entered key words that can be accessed by 


the user at his or her leisure. 


word index 
ehts 


reproduction 
or death 





Figure 8. _ Info-Spider Architecture (From [15]). 
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e. PageRank 


In 1998, Sergey Brin and Lawrence _ Page forever changed the way the 
world searches for relevant web pages with the developm ent of Google and the 
subsequent implementation of the PageRank al gorithm. According to [16], PageRank is 
an algorithm that ranks a web page based so lely on its incom ing and outgoing hypertext 
links. In general, pages with m ore incoming links are viewed as being more "im portant" 
than those with less in coming links. The eas iest way to envision the concept is as a 
citation format. Each web page hypertext link is acitation or vote of approval for the 
web page it points to, with the weight of _ the citation based on the num ber of votes of 
"importance" the page receiv es. Equation 2.2 defines a slightly sim plified PageRank 
algorithm with R being the ranking, ua web page, F, as a set of pages u points to and B y 
as a Set of pages that point to u. The number of links from u is N,, =|F,| and c is a factor 


used to normalize all of the rankings. 


Rw=cy = 


veB, v 


[17] (2.2) 


The equatio nis recursive until co nvergence is reached. Figure 9 presents a visual 
example of such a s implified calculation reaching an approximate equilibrium. Initially, 
page A was given a value of 1.0 for i ts ranking. Having two links, this divides the value 
in half so that page B and C each have 0.5 ranking. With page B and C only having one 
outgoing link each, they both pass on their link's value to pages C and A respectively. At 
this point, page A has a value of 0.5, page Ba value of 0.0, and page C a value of 0.5. 
The Equation is applied recursively until equilibrium is reached, with the results shown in 
Table 2. 
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Figure 9. Simplified PageRank Calculation (From [17]). 















































Recursion # PageA Page B Page C 
1 1.0000 0.0000 0.0000 
2 0.0000 0.5000 0.5000 
3 0.5000 0.0000 0.5000 
4 0.5000 0.2500 0.2500 
5 0.2500 0.2500 0.5000 
6 0.5000 0.1250 0.3750 
7 0.3750 0.2500 0.3750 
8 0.3750 0.1875 0.4375 
9 0.4375 0.1875 0.3750 
10 0.3750 0.2188 0.4063 
11 0.4063 0.1875 0.4063 
12 0.4063 0.2031 0.3906 
13 0.3906 0.2031 0.4063 
14 0.4063 0.1953 0.3984 
15 0.3984 0.2031 0.3984 




















Table 2. PageRank Recursion Equation Calculations. 


Problems can arise with this particular ranking function due to a po tential 
issue known as "rank sink." Simply put, if any pages are fetched and point only to each 
other, an infinite loop w ill occur, causing th e web page ran ks to in crease, but nev er be 
distributed. An illustration of such an event is given in Figure 10. To solve this problem, 


a ranking source vector E(u) is introduced in Equation 2.3. The ranking source vector is 
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used as a source of rank to prevent rank sin k. Intuitiv ely, it "corresponds to the 
distribution of web pages that arandom surfer periodically jum ps to," with E typically 
equal to 0.15 [17]. R' therefore changes to become an assignment of PageRank to a set of 


web pages. 


R 





R\u)=c> OL cR@) [17] (2.3) 


veB, N 


V 









































Figure 10. Loop Which Acts as a Rank Sink (From [17]). 


The final PageRank formula is developed by going one step further and by 


replacing c with a dampening factor d in Equation 2.2: 


PR(u)=(1-d)+d >> a 


veB(u) v 


[17] (2.4) 


The da mpening factor shown above isasi mple means of directly manipulating the 
PageRank. In general, it should be thought of as the probab ility that a u ser will follow 
the links and (1—d) as the scoring distribution from non-directly linked pages. 
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One of the biggest issues mentioned by Brin and Page in their research are 
"dangling links" [17]. Dangling links are defi ned as any link that points to a page that 
has no outgoing links. Due to the fact that these links do not have an affect on the 
ranking, they are rem oved from the system and added back in after convergence of the 
PageRank algorithm. Normalization of the other links will change slightly but should not 


have a large effect on the total population of web pages. 
C; PAGERANK ALGORITHM VARIATIONS 


Since publishing the generic PageRank al gorithm, Google has m oved forward to 
dominate the W orld Wide Web Search Engine business. Microsoft Network, Yahoo!, 
Ask, and others still exist and have m aintained a significant amount of market share but 
are nowhere close to that of Google [7]. Google's actual algorithm and code, along with 
the other companies’ mentioned above are still proprietary. Listed below are other known 
algorithms that attempt to im prove upon Google's initial PageRank algorithm with their 


own variant. 
1. Topic-sensitive 


A "topic -sensitive," "to pic-centric" or "f ocused" cr awler is an algo rithm that 
returns a "local ranking based on each user's preferences as biased by a set of pages they 
trust o r top ics the y pr efer" [18]. This approach differs from PageRank by taking 
advantage of personalization, tailoring infor mation specific to the search context. It also 
allows an increase in information relevance at the cost of co mputational resources. To 
determine relevance, a similarity score is initially calculated as previously show n in 
Equation 2.1. This score determ ines the rele vance of the current page and is used as a 
component to determ ine the final link score. Equation 2.5 calculates the link score, 


Linkscore(j) by adding together the URL score, URLscore(j), with the anchor tex t 
score, Anchorscore(j) [19]. Linkscore(j) is th — e score of the hypertext link d3 
URLscore(j) is the similarity between the curr ent page's hypertext link information of 
j and the topic specified; and Anchorscore(j) is the sim ilarity between the anchor tex t 


and the topic specified. 
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Linkscore( j) =URLscore(j) + Anchorscore(j) (2.5) 
After the link score is determined, af inal score f or the link is ca Iculated by 
combining the curren t page's similarity score with the prev iously calculated link sc ore. 


Equation 2.6 calculates the final score, Score To _PR(j),byadding TP(j) with 
Linkscore(j) [19]. Score To PR(j) is defined as the final score of the Topic- 
PageRank algorithm with respect to link j; TP(/j) is the Topic Page similarity score; and 


Linkscore(j) is the score of the link previously calculated in Equation 2.5. 


Score To_PR(j)=TP(j)+Linkscore(j) (2.6) 


Experiments to determine the performance of the above algorithm were conducted 
by Yuan, Yin, and Liu [20]. Accordingly, a metric called the "harvest ratio" was devised 
to quantize perform ance. Equation 2.7 shows the harvest ratio as the p ercentage of the 
number of relevant pages divided by the total number of downloaded pages. The topics 
searched for in this experiment were American History, New Car, China travel and huang 
shan travel, with their corres ponding results are shown in Table 3. Overall, Breadth-first 
had the worst ranking values with an averag e ranking of 0.3375 and the largest variation 
in value. PageRank prefor med better with an average ranking value of 0.4625 a nd had 
the least variation in value. T -PageRank performed the best with an average ran king 


value of 0.6225 with only slight variations in value. 


# of _Relevant_ Pages 




















Harvest _ Ratio = (2.7) 
= # of _Dowloaded _ Pages 
Topic Language | Breadth-first | PageRank | T-PageRank 
American History English 0.34 0.47 0.64 
New Car English 0.34 0.47 0.65 
China travel Chinese 0.29 0.46 0.59 
huang shan travel | Chinese 0.38 0.45 0.61 























Table 3. | Harvest Rate of Topics (From [20]). 
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As shown in Table 3, the top ic-sensitive a lgorithm was m ore ef fective at 
providing relevant results when compared to the breadth-first and PageRank algorithm s. 
In a different experiment, according to [18], approximately 70 percent of the pages being 
returned were the sam e between a topic-se nsitive crawler and that of Google's Global 
PageRank. The difference between the two resu Its is due to the fact that as m ore pages 
are crawled, the results begin to converge. Additionally, seed URLs determine where the 


search engines look next. If they are the same, the results will be similar. 
Z. Weighted 


The W eighted PageRank ( WPR)a lgorithm is an extension of the origina | 
PageRank algorithm, taking into account the im portance of both the in and out links by 
"distributing rank scores based on the popularity of the pages" [21]. Sim __ ply put, the 
algorithm assigns larger rank values to page s that are m ore popular instead of dividing 
the rank value assigned to every page evenly am ong t he out links. Equation 2.8 


calculates the weighted popularity of the in links as W’” 


(yu) + This is "based on the num ber 


of in-links of page u and the num ber of in-links of all reference pages of page v" [21]. 


I, and I, represent the number of in-links of pages u and p respectively. R(v) is the 


reference pages list of page v. 


I, (7.8 


yas I, 


IN 
Wouy = 


Accordingly, the ou t links are calcu lated in a sim ilar way, using Equation 2.9. 


Wis the weighted popularity of the out links. This is based on the number of out- 
links to the page uw and the number of out-links of all reference pages of page v. O, and 
O, represent the num ber of out-links of pages wu and p respectively. R(v) is the 


reference pages list of page v. 
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O.__ (9.9 


(vu) 
Die 0, 


Knowing the above information, the final PageRank formula, Equation 2.4 is then 


modified to: 





PR(u)=(1-d)+d >> eas We Wow, (2.10) 


veB(u) v 


Testing for the Weighted PageRank Algorithm was done using the query "scholarship" in 
[21]. Table 4 presents the size of the page set obtained, the number of relevant pages and 
the relevancy value for the given pages. In general, W PR is shown to have higher values 
for the given relevant pages found, but is st ill finding approximately the same number of 


relevant pages as the original PageRank algorithm. 


| == ——~———s«| Number of Relevant Pages | Relevancy Value(«) 
WPR | PageRank 
51 | 16s 
71 
4.8 
162 


821 
n7.1_| 1198 | 
1623 | 


159.6 62.3 


2107 


Table 4. "scholarship" Query Results (From [21]). 


| PageRank | 
| Ol 
| 47 
| 82.1 | 
| 7 
| 159.6 | 
| 217 





3. Usage-based 


According to [22], Usage-based PageRank (UPR) is a modification of the original 
PageRank algorithm in that it additionally ra nks web pages based on the previous user’s 


navigation behavior. The com putation is esse ntially biased using the infor mation from 
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the previous user's visits that are recorded in the website's log. To do th is, a trans ition 


matrix m and personalization vector p are both defined in such a way that the pages and 


paths previously visited by other users are ranked higher. 


Following the properties of a Markov theory and the PageRank algorithm _, the 
Usage-based PageRank vector, UPR, is calculated as follows: 


UPR =(1-—€)m*UPR+ €PER (2.11) 
where ¢ is the dampening factor, with m as an N x N transition matrix whose elements 


m,, equal 0 if there does not exist a link from page p, to p,;. m, is defined in Equation 


2.12 with the personalization vector PER provided in Equation 2.13. 


Ww... 
= 8 9) 
> w 


Py SOUT (p;) 





Nxl 


The weight w, for each node represents the number of times page p, was visited and the 
weight w,_,, on each edge represents the number of times p, was visited after p,. These 


equations, when com bined, result in the final UPR equation given in Equation 2.14, 


which was represented previously by Equation 2.11. 


7 n-l W joi W; 
UPR"(p)=e | UPR" (pj) ——=?— _]+(-2) = (2.14) 
pjeIN(p;) ba Wisk >, Wi 
Py EOUT(p;) pews 
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In [22], testing for the algorithm was limited, using publically available data from 
msnbc.com. Comparisons were made showing that UPR performed better than the o ther 
two at p redicting accuracy. To its advantage, the process of ranking the next possible 
pages took less than 2 seconds and could __ be done online without delaying navigation 


[22]; 
4. TimeRank 


TimeRank is another variant of PageRank in that it uses the web page's record of 
the last visited time to determine its degree of importance [23]. Essentially, it uses a time 
factor to improve upon the precision of a given ranking, basing it on the amount of time a 
user stays on the website. The longer tim e logged, the m ore im portant the page. 
TimeRank is calculated by Equation 2.15 [23]. TR(j) is the f inal calculated score; 
Score To _PR(j) isthes ame score calculated from Equation 2.6's Topic-Sensitive 
algorithm and ¢(i) is the total visiting time of a page related to a topic. ¢(Z) is initially set 


at 1 to avoid a zero ranking of a relevant topic web page. 


( ) TR j =Score_To_PR(j)*t(i) (2.15) 


Unfortunately, som e com plications arise with the algo rithm due to process ing 
server logs. Arule regarding the use of web proxies is applied to de termine a v alid 
source IP. If the source IP is the same in 30 minutes, it is treated as one user, otherwise it 
is discarded. Another issue not discussed is the fact that a page could be long and contain 
a lot of inform ation that the r eader must sift through. If this is the case, a page m ay be 
related to th e general topic entered, but no t the specific topic searched for and h ave a 


higher score due to the ¢(i) factor. 


5: DYNA-RANK 


The final PageRank variant discusse dis the DYNA-RANK algorithm. DYNA- 
RANK focuses on "efficiently calculating and updating Goog le's PageRank vector using 


‘peer to peer! systems" [24]. Changes in the web st ructure are handled increm entally 
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amongst peers, requiring less computation time and a fewer number of iterations 
compared to a cen tralized approach. The conc ept uses the fact that ch anges will 0 nly 
affect up to acertaind omain, not requiring a full recalculation of ranking vectors for 


others outside the domain. 
The original PageRank formula is initially used when applying the DYNA-RANK 
algorithm. Equation 2.16, new_ weight(K,L) is used to calculate the out-link weights 


for all of the out-link weights within the peer: 


P(K) 


new_ weight(K, L) = ——+——__ 
(1K) peer) #1 


(2.16) 
where new_weight(K,L) is the new edge we ight calculated; P.(K) is the PageRank 
value ofnode K and n(K)prre;) 18 the num ber of out-links ofnode K on PEER(i). 


PEER‘(i) is defined as a specific dom ain or p eer grouping. To figure out which links 


need to be updated, a relative change value, RC is calculated according to Equation 2.17: 





_ abs(new _ weight —old _ weight) 


RC O17) 


(new _ weight) 


where old weight was the previously calculated new_ weight(K,L). 


Overall, DYNA-R ANK perform s well in reducing the time to reach relative 
convergence as well as the num ber of iterations required [24]. Future work is needed to 
evaluate this algorithm further with rega rds to how well it would work given a topic- 


sensitive PageRank algorithm. 


Having now surveyed a variety of algor ithms available for use in an IED 
Education Network WebCrawler, none appear to be specifically tailored or easily capable 
of discovering hidden networks within the W orld Wide Web. Ino rder to carry the 
research forward,as pecific W ebCrawler must be chosen for future work and 


implementations; allowing an inside look at the current algorithm being used by the 
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WebCrawler. Criteria for choosing the WebCrawler was that it must be free, open source 
software th at is scalab le and easily dep! oyed. Knowing this, our choice for an IED 


Education Network WebCrawler was the Nutch project. 
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Hl. NUTCH 


A. INTRODUCTION 


The Nutch project is a Java based open-s ource search engine, capable of crawling 
a simple intranet, subse t of the Internet, or the entire World Wide Web [25]. Prior to 
Nutch's development, it was generally not possible to analyze why any random search 
from a popular search engine w ould rank a generic web page y higher than web page x 
for a given query. This was in part due to the fact that most search engine algorithms are 
considered proprietary, as well as to prevent spammers from manipulating text and links 
in order to specifically boost a particular we bsite's rank. The Nutch project attem pts to 
solve the algorithm dilemma by being open-sour ce. Its purpose is two-fold, to bring 
transparency and a detailed explanation of how the score for a given web page or 
document is computed in a search engine while providing an alternative search engine for 
people who are not fully satisfied with the limited number of commercial Internet search 
engines in e xistence today. Additio nally, Nutch observes ro bot exclu sion protoco Is to 
allow administrators the ability to control which parts of their host are collected in this 


manner. 
B. ARCHITECTURE 


The Nutch project's architecture is designed to be scalable in both search size and 
speed, while im plementing para llelization re trieval techniques in the process. Its 
operation can be div ided into three parts, a crawler, indexer and a s earch interface [2 5]. 
Figure 11 presents this conceptually from a high level design point of view. The crawler 
is designed to search through any given file sy stems, intranet, or the W orld Wide Web. 
This information is then stored via a databa se named WebDB and cached for future use. 
In addition to storage, the crawler uses a program named Lucene to index the information 


retrieved. This index is then used to retrieve the data from WebDB via a search interface. 
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Lucene 


Crawler 


Search 
Interface 


Figure 11. Nutch search engine high level design (From [25]). 


The m ain advantage of using Nutch ove __ r other search engines is that the 
architecture is scalable. Sim ply put, whet her there isan eed to index one dom ain or 
many, even filter out others, it can handle them all. Nutch accomplishes this by using an 
extensible markup language (xm1) format plug-in architecture that provides the user with 
the ability tom ake modifications over a wide range of param eters without having to 
make any hard coded changes to the Java code. The Nutch default xml configuration file 


is contained in Appendix A. 
C; LUCENE 


Lucene is at the heart of the Nutch search engine. W ithout it, the Nutch crawler 
would only gather information, storing it into a database void of organization. According 
to [26], Lucene isam ature, open-source Java program that provides indexing and 
searching capabilities. It is not an application program like many think, but a Java library 
that does not m ake assumptions about what it indexes or searches. Essentially, Lucene 
can be applied to search and index any type of file thatcan —_ be converted into a 
recognizable text form at. Figure 12 illus trates this difference between Lucene and an 
external application using it. Applications using Lucene present an in terface to enable 


the user access Lucene’ s index while gathering different types of data at the sam e time, 
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completely dependent upon user input. Lucene _ differs from this by taking the data 
obtained through an external application and bringing order to it through indexing. 
Overall, it provides am _ eans of searching th e index generated in order to present the 


desired information in an application. 










Database 





File System 


Present 
Search 


. 
a 







Application 


Lucene 


Figure 12. Typical application integration with Lucene (From [26]). 


In addition to Lucene’ s ability to in dex docum ents, it has a transparent scoring 
algorithm which sets it apart from other indexing programs. The formula used by Lucene 


to score relevant documents d for a given query q is as follows: 


score(q,d) = » tf(t__in_ d)-idf(t)’ -boost(t. field _in_d)-lengthNorm(t.field_in_ d) 
t_in_q 
(3.1) 


where ¢f(t_in_d) isthe term frequency factor for the term ft in document d, which 
allows docu ments with a higher ter m frequency obtain a higher score. idf (t) is the 


inverse document frequency of the term, which allows documents that contain rare search 
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query terms to obtain a higher score. boost(t.field _in_d) is auser biasing boost value 
that can be given to a document set during indexing for a specific 1. field , being the term 
field in document d. Finally, lengthNorm(t.field _in_d) is the normalization value of 
a field, given the num ber of terms contained within the field, allowing a higher score to 
be assigned to a field that is short and contai ns a searched query term. The field values 
discussed above are provided via xm | meta tag data, specifically u rl, anchor tex t, title, 
host and phrase. Equation 3.1 c an be e xpanded by m ultiplying the re sulting score by 
coord(q,d) and queryNorm(q). coord(q,d) is a coordination fact or, a score based on 
how many of the query term sarefound inthe docum ent while queryNorm(q) isa 
normalizing factor used tom ake scores co mparable betw een queries. In Nutch, the 
formula changes sligh tly bym_ ultiplying the resulting score, score(q,d) by an 


Overall _ Boost(d) value, shown in below: 


Overall _Score(q,d)= Overall _ Boost(d)-coord(q,d)-queryNorm(q) - score(q,d) (3.2) 


where Overall Boost(d) is a boost factor determined by Nutc h's page ranking 
algorithm for docum ent d and Overall _Score(q,d) is the final score of document d 
for a given query g. An exam ple calculation for Equati ons 3.1 and 3.2 is contained in 


Appendix B. 
D. ADAPTIVE OPIC 


Nutch is one of the few WebCrawlers to im plement the Adaptive On-Line Page 
Importance Com putation, better known as OP IC. Developed in 2003, the algorithm is 
computed on-line during fetch sequences in order to "focus crawling tothe m_ ost 
interesting pages" [27]. The advantage OPIC has over other algorithms is that it does not 
use a lot of CPU or other disk resources, specifically by not needing to store the actual 
link matrix, like Page Rank. Essentially, th is algorithm can be thought of as a "non- 
iterative we ighted ba cklink-count s trategy," w here th e ra nking value is sp lit ev enly 


among its outgoing links producing a type of greedy algorithm [28]. 
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Nutch im plements OPI C by injecting the root node with a specific amount of 
value or "cash" as it is comm only referred to. The value injected is norm ally one unless 
otherwise specified. W hen discussing cash v alues within Nutch, there are two specific 
types: current and h istorical. Current cas h is the am ount of cash ad ocument receives 
from incom ing links bef ore or after processi ng. Typically, this value is the am _ ount of 
cash value it receives from other docum ents' out-links having been processed or else was 
injected with to begin an initial w ebcrawl. Historical cashis the amount ofc asha 
document has after pro cessing and after a search iscom plete. W hena docum ent is 
processed from the fetch list, the cash is split evenly among the out-going links as shown 
below: 


Current _Cash(d) 


Outlink _ Current _ Cash(d)= ; 
Num _ OutLinks(d) 


(3.3) 
where Current Cash(d) is the current cash value of docum ent d being processed and 
Num _ OutLinks(d) is the num ber of links com ing out from document d. These newly 


discovered out-links are then added to the we _ b link database, as well as the fetch list 
database f orf uture process ing. W ithin the f etch lis tdatabas_ e, the 


Outlink _ Current _Cash(d) value is also stored and us edasam easure to determ ine 


which node is processed next. In general, the sear ch turns into a br eadth-first variant 
where nodes for a specif ic depth level are not se arched in the order f ound, but rather by 


their current cash score. 


After a WebCrawler search is complete, the final value stored in historical cash is 


the actual OPIC score for a document, OPIC _ Score(d) defined as: 


OPIC _ Score(d) = Current _Cash(d)+ Historical _Cash(d) (3.4) 


where Current _Cash(d) is the accumulated current cash of document d at the end of a 
search and Historical _Cash(d) is the historical cash value of document d , determined 


at fetch processing time. This factor affects the final score ranking of a document via the 
overall boost factor found in Equation 3.2, with the Overall _Boost(d) defined as: 
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Overall _ Boost(d) = JOPIC _ Score(d) (3.5) 


Some discussions have taken place in online blogs about why the square root value of the 
OPIC score is used instead of the straight score or a logarithmic value. Doug Cutting, the 
creator of both Nutch and Lucene, statedin many of them that the overall boost value 

was calcu lated this way to p revent the OP IC score from overly influencing docum ent 
ranking. Either way, a logarithmic function and a square root func tion are both types of 


power functions and can manipulate the score in a similar fashion. 


Knowing the above infor mation,a newalgorithm can now be developed 
specifically for IED Education Networks base d solely on influencing the OPIC score of 


Nutch without affecting Lucene’s scoring factors, which are based on query terms. 
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IV. ALGORITHM DEVELOPMENT 


A. PROBLEM DEFINITION 


When conducting any search over the W_ orld Wide Web, the results are only as 
good as the algorithm linking the database together and the scoring equation used to filter 
out unwanted docum ents via content. Initia lly, this thesis focused on changing the 
weighted plug-in boost values of the five fields used to score a document, those being url, 
anchor text, title, host and phrase. These valu es are calculated at que ry time and have a 
mild effect on the final scoring of adocum ent, but are ultimately shaped by the O PIC 
value calculated during the fetch sequence. IED education networks can easily vary their 


meta-tag data depending on how visible they would like their information to be. 


The Nutch OPIC algorithm assumes that all out-going links are equal. In reality, 
no link is created equal. To fix this, we chose to change the OPIC algorithm in order to 
assign a higher OPIC value to the pages which are referred to more, thereby ensuring web 
pages with more significant im portance are ranked accordingly. This will in tu rm allow 
an IED focused W_ ebCrawler to appropria tely weigh potential root node docum _ ents 


higher, thereby making it easier to discover IED education networks. 
B. ASSUMPTIONS 


While attempting to develop anew algor —_ithm, itm ust be assum ed that the 
networks being searched are tr uly random. IED education ne tworks come in all sh apes 
and sizes and can easily range from just a single web page describing how to make one, 
to hundreds of web pa_ ges with sim ilar inform ation passed am ong them. Second, all 
depth levels are con sidered equal. The reason for this is to have a ba sis of comparison 
within a web search. In addition, it is assumed that the education networks being sought 
are trying to stay hidden within their respective domains and will not be easily located by 


their domain name, such as www.HowToMakelIEDs.com. 
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C. NEW ALGORITHM 


Given the above criteria and assumptions, the new algorithm developed takes into 
account the fact that there exist four types of links coming out of a document: self referral 
links, external dom ain links, new docum ent links within the dom ain and previously 
discovered docum ent links, either external or internal to the domain. Identif ication of 
these types of links is c ritical in properly influencing the value of the O PIC score being 
given to those docum ents. Knowing this, the following algorithm was developed w here 


the current cash value or portion a node receives, Cash_ Portion(d) is equal to: 


Current _Cash(d) 


Cash_ Portion(d) = 
S(d)-Swet+ N(d)-Nwet+O(d)-Owgt+ E(d)- Ewgt 


(4.1) 





where Current Cash(d) is the current amount of ca sh contained within docum ent d, 
S(d) is the num ber of se lf referral link s leaving the docum ent, Swet is th e we ight 
assigned to self referral links, N(d) is the num ber of new document referrals, Nwegt is 
the weight assigned to new docum - ent referrals, O(d) is the num _ ber of previously 
discovered docum ents referrals, Owgt is the weight assigned to previously discovered 
document referrals, E(d) is the number of external link referrals and Ewgt is the weight 
assigned to external link referrals. 

For example, a given document that had a current cash value of 0.25 was selected 
to be the next docum ent processed via the fe tch list datab ase. During process ing, it is 
discovered that the document has 8 out-going links: 2 of the 8 links are self referral links, 
4 links are new links with one being external and the last 2 out-going links are found to 
be previously discovered docum ents. W eights for the different types of links provided 
are equal to 1, sim ulating the we ighting effect of the original OPIC score. Given this 


information and applying Equation 4.1 results in the | Cash_ Portion(d) for each out- 
going document link equal to 0.125. 
Following the logic giv en above, the OPIC current cash value for each out-goin g 


link is calculated as: 
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Actual _Cash_ Portion(d)=Cash_ Portion(d)- Assigned _Wet (4.2) 


where Actual Cash_ Portion(d) is the portion of docum ent d's current OPIC cash 
value being given to aspecif ic out-going link, either S(d), N(d), O(d), E(d). 
Cash_ Portion(d) is the value obtained from Equation 4.1 and Assigned Wet isthe 
weight previously assigned to the type of document link being processed, which can be 
either Swet, Nwgt, Owgt and Ewegt. Continuingthe pr — evious exam ple, the 
Actual Cash_ Portion(d) from Equation 4.2 would be equal to Cash_ Portion(d) 
calculated from Equation 4.1 because of the weight for each going link being equal to 1. 
Now, consider th esa me docum ent given in the p revious exam ple with th e 
following weighted scores: Swgt equalto1, Nwgt equaltol, Owgt equal to 2 and 
Ewegt equalto 1. The Cash_ Portion(d) for each of the out-going docum ent links 


decreases to equal 0.1. This is significantly less than the amount previously calculated. 
The Actual Cash_ Portion(d) is then calcula ted to be 0. 1 for all of the outgoing links 
except for the previously discovered links, whic h are each now equal to 0.2. This value 
is now significantly higher than the previously determined value, therefore showing that 


these nodes are of greater significance within the overall web link graph, shown in Table 


























Links Type OPIC Score New Algorithm Score Difference % Change 
1 Self Referral 0.125 0.1 -0.025 0.2 
2 Self Referral 0.125 0.1 -0.025 0.2 
3 New 0.125 0.1 -0.025 0.2 
4 New 0.125 0.1 -0.025 0.2 
5 New 0.125 0.1 -0.025 0.2 
6 New 0.125 0.1 -0.025 0.2 
7 Old 0.125 0.2 0.075 0.6 
8 Old 0.125 0.2 0.075 0.6 


























Table 5. | Original OPIC versus New OPIC Scoring. 
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Having now developed a new al gorithm capable of ranking documents with 
specific links higher than others , testing was needed to form ulate a true understanding of 


the algorithm’s potential and future use against IED Education Networks. 
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V. PERFORMANCE MEASUREMENTS 


The goal of the testing perform ed below was to establish a prelim inary means of 
judging the effectiveness of the new proposed algorithm ’s ability to score web pages 
when compared to the original OPIC algor ithm, independent of Nutch. MATLAB code 
was created to random ly generate networks in order to perfor m an analysis given three 
different types of simulations. Multip le sim ulations were conducted with only three 


examples discussed herein. 


A. EXPERIMENTAL SETUP 


1. Hardware & Operating System Configurations 


The platform used to conduct the simulation was a single Dell XPS M1330 laptop 
personal computer. This machine had an Intel Core 2 Duo CPU T9300 at 2.5 GHz, with 
4 GB of RAM and a 185 GB hard disk. The operating system used was Microsoft 
Windows Vista with Service Pack 1. 


Zz Simulation Configuration 


The software used to conduct the ra ndom net work sim ulation and algorithm 
calculations was the MathW orks Matlab R2008a Windows program. Matlab is a private 
distribution program and requires a license. No special toolboxes or functions outside the 
original program were needed to perform the simulation. The software used to plot the 
resulting data was the Microsoft O ffice Excel Windows program. Microsoft Excel is a 
private distribution program and requires ali cense. No spe cial toolboxes or functions 


outside the original program were needed to plot the results. 
B. BENCHMARKING 


Benchmarking is the p rocess of characterizing asystem asawholeo r via its 
various parts in order to understand the actual or potential performance. In this particular 
case, three simulations were conducted, varying the random number of potential outgoing 
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links. The first case,sim ulation | contai ns alow complexity random _ ly generated 
network with the maximum number of out-links equal to 5. The second case, sim ulation 
2 is am edium complexity randomly generated network with the maximum number of 
outgoing links equal to 7. The final case, simulation 3 is a high com plexity randomly 
generated network with the m aximum number of out-links equal to 10. All si mulations 
were generated using the following document link probabilities contained below in Table 
5. The probabilities shown in Table 6 are not based on any particular network, but were 
chosen to ensure that the random networks generated will continue to propagate and have 
the ability to expand. Additionally , the depth level f or all sim ulations was selecte d to 


equal 5 in order to visually present the results with clarity. 





























Probability | Type of Document 
New Document Internal 0.45 1 
New Document External 0.05 2 
Self Referral Link 0.05 3 
Previously Discovered Document 0.45 4 





Table 6. Probability of Creating Specific Document Links. 


All 3 sim ulations ca Iculate the original Nutch 0.8.1 OPI C score and 4 variant 
scores. The original Nutch OPIC is defined in Equation 4.1 as Swgt, Nwgt, Owgt and 
Ewgt all equal to 1. Variant 1 is def inedas Swgt, Nwgt and Ewgt equal to | w hile 
Owgt is equal to 2. Varian t2 is defined as Swgt, Nwgt and Ewgt equal to 1 while 
Owsgt is equal to 4. Variants 3 and 4 are respectively similar to variants 1 and 2 with the 
exception of Swgt being equal to 0. The reason for using the 4 different variants was to 
determine if there is any benefit to becoming extremely "greedy" with the algorithm and 


also to evaluate the effect of removing self referral links from the networks. 


Variation for a particular document d is calculated as: 
( ) Variation d = Final_Cash(d)—Level _AVG_ Cash (5.1) 
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where Final _Cash(d) is the final cash value of document d and Level AVG_ Cash is 


the average cash value for the docum ent's level. Following this log ic, the perc entage 


variation of document d is calculated as: 


Variation(d) 


eT el AVG _Cash 


(5.2) 


i: Low Complexity Network 


The first type of random network to be | ooked at is one of low com plexity. Low 
complexity is defined here as a network with less than 20 docum ents in its web-link 
graph. Figure 13, shown below,i sa visual representation of the network's web-link 
structure. In order to constr uct Figure 13, Table 7 was used . Table 7 contains the data 
generated in Matlab to create the n etwork. Column | displays the Docum ent Number, 
which is defined as the num ber assigned to a docum ent once a link to the docum ent has 
been discovered and is unrelated to processing order. Column 2 is_ the depth level the 
document was found in. Each depth level is se parated by a bold line for ease of viewing. 
Column 3 is an external flag marker, with 0 equal to an internal document and 1 equal to 
an external. Colum n4 is the num ber of outgoing links. This num _ ber is determined 
randomly with 5 links being the m aximum number of out-links possible in this 
simulation. Column 5 contains the type of out-links for the given number of out-going 
links in colum n 4, det ermined using the probabilities given in Table 5. Column 6 
displays the out-link docum ent number corre sponding to the link given in column 5. 
Previously discovered docum ent num bers are random ly determ ined from the gi ven 


number of documents in the web-link graph at the time of discovery. 
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Figure 13. Simulation 1: Low Complexity Web Link Graph. 






























































Doc Num|Depth| Ext Flag |Num Outlinks| Type of Outlink | Outlink Doc Num 
1 1 0 3 3/1/1/0]0] 1] 2 | 3 |0|0 
2 2 0 3 41/4|1/0]0}] 3 | 1 | 4 |0|0 
3 2 0 4 1/4/1}1/0]5 | 4 | 6 {7/0 
4 3 0 2 21/;2|0/0/0};]8|9 | 0 |0|0 
5 3 0 2 3/4/0/0/0;] 5 | 3 | 0 |0|0 
6 3 0 5 41;4}|4/4/4]1]1 | 3 [4/5 
7 3 0 1 41/0|0/0]0}] 2 {0 {0 |0|0 
8 4 1 3 1/1]1 {|0/0]10/11]|12 /0/0 
9 4 1 2 1|/4/0,|0]0;13;12| 0 |O|0 
10 5 0 3 1/1/4{]0/0|14/ 15] 14 |0/0 
11 5 0 2 41/1|0/0]0{|14|16]| 0 |0|0 
12 5 0 0 0;0;0;0{0/ 00/0 |0\0 
13 5 0 2 1/2/0/0|0/17/18| 0 |O|0 
14 6 0 0 0;0;0/;0;0/ 00/0 |0\0 
15 6 0 0 0;0;0;0{0/ 00/0 |0\0 
16 6 0 0 0;0;0;0;0/ 00/0 |0\0 
17 6 0 0 0;0;0;0{0/ 00/0 |0\0 
18 6 1 0 0;/0};0/;0/0}] 0} 0} 0 j0}0 















































Table 7. | Simulation 1, Low Complexity Web Link Graph Data. 


40 


Evaluating sim ulation | is v_ ery s traight forw ard. F igure 14, show n below, 
provides an overview of the OPIC score trend, with random _ spikes representing 
documents with a higher importance. Depth level 2 document comparisons, contained in 
Figure 15, demonstrate a significant change in the OPIC scores, but m irror changes with 
respect to the original OPIC trend. Variant algorithms 3 and 4 continue the trends found 
in variants 1 and 2, with the increase in score attributed to the removal of document 1's 
self referral link. Variations with respect to the average cash values within depth le vel 2 
are presented in Figure 16, with Figure 17 showing it as a p ercentage of the average cash 
value in the level for a given variant. Both of these figures show that the OPIC score for 
document 2 drops proportionately with any gain in OPIC score by document 3. This is to 
be expected as docum ent 2 gives more cash_ to document 3 based on the network's link 


structure. 











OPIC Score 











OPIC CASH VALUE 
° 
oo 

















0.6 
0.4 
0.2 
0 
1 2 3 = 5 6 7 8 9 10 11 12 13 
DOCUMENT NUMBER 


Original —@—Variantl —™Variant2 —t—Variant3 —<Variant4 








Figure 14. Simulation 1: Overall OPIC Scores. 
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Figure 15. Simulation 1: Depth Level 2 OPIC Scores. 
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Figure 16. Simulation 1: Depth Level 2 OPIC Score Variations. 
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Depth Level 2 OPIC Score % Variation 
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Figure 17. Simulation 1: Depth Level 2 OPIC Score % Variations. 


Additionally, depth level 5 also shows a significant change in OPIC scoring trend, 
shown below in Figure 18; but again, this mirrors the original trend. Variant algorithms 1 
and 2 follow previous trends as w ell, with variants 3 and 4 being in proportion to their 
respective counterparts. Figures 19 and 20 provi de the resulting vari ations with respect 
to the average am ount of cash within level 5 fo ra given variant and percentage of such. 
No new inform ation is gained from these gr aphs as there are no previously discovered 


links coming in to any of these documents. 
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Figure 18. Simulation 1: Depth Level 5 OPIC Scores. 
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Figure 19. Simulation 1: Depth Level 5 OPIC Score Variations. 
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Figure 20. Simulation 1: Depth Level 5 OPIC Score % Variations. 


2. Medium Complexity Network 


The second type of random network to be looked at is one of medium complexity. 
Medium complexity is defined here as ane _ twork with m ore than 20, but less than 50 
documents in its web-link graph. Figure 21, sh own below, is a visual representation of 
the network's web-link structure. In order to construct Figure 21, Table 8 was used. 


Table 8 contains the data generated in Matlab to create the network. 
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Simulation 2, Medium Complexity Web Link Graph Data. 
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Due to the increa sing c omplexity o f sim ulation 2's link str ucture, eva luating a 
medium com plexity sim ulation is a bit m ore difficult th an the prev ious. Figur e 22, 
shown below, provides an overview of simulation 2's OPIC scoring trend, with random 
spikes representing documents suggesting a higher im portance. Depth level 2 docum ent 
comparisons from Figure 22 show that docum ent 3 is m ore important than docum ent 2 
for all of the variant algorithms due to its web-link structure. This is to be expected since 
document 2 contains a self referral link as we Il as an outgoing link pointing to docum ent 
3. Depth level 4 is also shown to have a significant in crease inO PIC value for 
documents 13 and 14. Again, this is due to the self referral link in docum ent 7 and the 


incoming link from document 12 to document 14. 
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Figure 22. Simulation 2: Overall OPIC Scores. 
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Depth level 5 provides the m ost intere sting results f or the given variant 
algorithms, provided below in Figure 23. Initially, the OPIC value for document 19 is on 
par with other documents from within the level. Due to the removal of self referral links 
and additional value of previously discovered documents pointing to it f rom within the 
network, documents 19 significantly increases in value. This is illustrated in Figure 24 as 
a measure of change from the average cash value within the level. Figure 25 further 
explains this as an increase, ranging from 120 to 200%. Document 22 also significantly 
increases in value due to sam e reasons stated above, with the increase in value ranging 
from 400 to 1000% when com pared to the average cash value contained within the depth 


level. 
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Figure 23. Simulation 2: Depth Level 5 OPIC Scores. 
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Figure 24. Simulation 2: Depth Level 5 OPIC Score Variations. 
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Figure 25. Simulation 2: Depth Level 5 OPIC Score % Variations. 
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3. High Complexity Network 


The final type of random network to be looked at is one of high complexity. High 
complexity is defined here as a network with more than 50 docum _ ents in its web link 
graph. No figure is provided due to the ex treme complexity and length of the network's 
web-link structure. Appendix B contains the data generated in Matlab to create the given 


network. 


Evaluating a high complexity simulation is very difficult. Figure 26, shown 
below, provides an overview of si mulation 3's OPIC scoring trend, with random spikes 
representing documents with a higher importance. Due to the high number of documents 
contained in the network, this graph is only ab le to show that varia tions exist within the 


network, but will need further review within each level. 
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Figure 26. Simulation 3: Overall OPIC Scores. 
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Depth level 3 document comparisons from Figure 27 show that documents 10 and 
19 become significantly m ore important than ot her documents in the le vel for all of the 
variant algorithm s due to the network's web-link structure. Figure 28 shows this 
variation as a visible increase in the OPIC score for document 10, ranging between 140 to 
240%. Document 19 on the other hand is able to maintain its OPIC score while the rest 
of the docum ents around it decrease significantly withrespect to the average value, 


therefore maintaining its importance. 
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Figure 27. Simulation 3: Depth Level 3 OPIC Scores. 
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Figure 28. Simulation 3: Depth Level 3 OPIC Score % Variations. 


Depth levels 4 and 5 provide the most in _ teresting results for the given variant 
algorithms, shown below in Figures 29 and 31. Multip le documents increase their given 
OPIC scores, ranging between 10 to 650% in Figures 30 and 32. These levels 
demonstrate the effectiv eness of this algorith m by significantly increasing the scores of 
documents 41, 55, 59, 66, 73, 74, 77, 78, 79, 89, 90, 94, 95, 102, 110, 113, 115, 119, 133, 
134, 144, 150, 151, 161, 170, 177, 182, 184, 189, and 205 above the average value 
threshold, while effectively lowering the sc ores of docum ents 23, 27, 28, 29, below the 
average threshold value. These resu Its match the com plex link structure that is derived 


from the data contained in Appendix C. 


Overall, having conducted 3 random _ ne twork sim ulations, the results clearly 
indicate moderate success of our newly propos ed OPIC algorithm considering results are 
based solely on the web link graph structure. Comparing a document’s OPIC value to the 
average value contained within the depth le_ vel also allowed am _ easure of com parison 


regarding effectiveness. 
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Figure 29. Simulation 3: Depth Level 4 OPIC Scores. 
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Figure 30. Simulation 3: Depth Level 4 OPIC Score % Variations. 
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Figure 31. Simulation 3: Depth Level 5 OPIC Scores. 














% CHANGE IN OPIC SCORE 





Depth Level 5 OPIC Score % Variation 



































ae) 
a - 








DOCUMENT NUMBER 


@Original @Variantl W@Variant2 M®Variant3 | Variant4 


Figure 32. Simulation 3: Depth Level 5 OPIC Score % Variations. 
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VI. CONCLUSIONS 


A. SUMMARY 


The research com pleted in th is thesis showed that when im plementing the new 
OPIC algorithm variations, documents referred to more within a given web graph receive 
a higher percentage of the overall O PIC cash within that level and throughout the overall 
web graph, when compared to th e original algorithm. This intu nm eans that the 
document with a higher OPIC value is m_ ore relevant based solely on its link structure. 
Variants 3 and4 show them ___ ost prom ise with regards to changing the OPIC score 
effectively by rem oving self refe rral links. We believe that applying this to the Nutch 
WebCrawler will make it an effective tool in helping to disc over, track and monitor IED 


education networks over the World Wide Web. 
B. CONCLUSIONS 


Based on the experimental results give ninChapterV, them ost im portant 
documents within a web graph can be filtered out for a given level via an OPIC threshold 
score. To do this, a reasonable threshold valu e for a given level m ust be set by the user. 
In these exp eriments, the average value of a node within the depth level was us ed with 
moderate success. Additionally, it was confirmed that the more documents found during 
a given search increases the chances of another document's OPIC score being influenced, 
thereby increasing their overall score and the chance that the document will cross the set 


depth level threshold value. 


Overall, this research delivered a random network generator with plug-ins capable 
of simulating the Nutch OPIC algorithm, as well as a new OPIC variant algorithm. In the 
end, it mu st be remembered t hat no matter how great an algorithm is at ranking, the 
results will only be as good as the pages inde xed by the search engine. A page cannot be 
ranked if it has not been retrieved. Allof these issu es and more must be tak en into 


account when attempting to find IED education networks over the World Wide Web. 


my 


C. FUTURE WORK 


Domain comparison is a serious issue not ad dressed within the sco pe of this 
project. D omains were not separated usi ng this search techni que, implying a higher 
importance to the initial domain searched and less to those found during the search. This 
will pose s ignificant p roblems when attem pting to searc hacross m ultiple dom ains. 
Additionally, once the cash value given to a node becom _ es small enough, Java floating 
point errors have the potentia 1 to becom ea problem for la rge web-link graphs. It is 
unknown at this time how big of a web link graph would be needed to make this problem 


a reality. 


Implementation of this new algorithm in searching for IE D education networks 
using Nutch could be accom plished through many different methods. One way might be 
to use a cluster of diffe rent computers with many different addresses and m erge their 
results. Unf ortunately for this app roach, the d omain com parison pro blem previously 
mentioned will pose signif icant challenges. A nother would be to use Nutch as a cover; 
actually knowing an IED education network ex ists fora given dom ain and initiating a 
crawl using the known IED education networ k root node docum ent to determ ine the 
depth of the network's existence. Currently, Nutch is optim ized for this by being able to 
effectively search a single dom ain knowing th at the initial document has significant 


importance. 


Monitoring IED education networks found usin g this algorithm is the next step in 
determining the true measure of the new algorithm's effectiveness. Unfortunately, Nutch 
has inherent flaws implementing OPIC in that the h istorical cash in the system builds 
very early and decays slowly over tim e. Th is will cause scoring problems for later 
searches that attem pt to m onitor changes in OPIC scores concerning sites of inte rest. 
Later versions of Nutch have neutralized th is problem by resetti ng the historical cash 
equal to zero upon re-crawl. Again, this causes another problem in that docum ents of 
significant importance are not gi ven any weight for having b een previously found to be 
important. Overall, these problem s andconcerns willneed cons __ iderable res earch 


conducted to achieve a more effective IED education network web crawler. 
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APPENDIX A. NUTCH XML CONFIGURATION FILE 


The following text file given below is the standard default Nutch XML 


configuration file: 


<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xs1"?> 

















<!-- Do not modify this file directly. Instead, copy entries that you 
--> 

<!-- wish to modify from this file into nutch-site.xml and change them 
--> 

<!-- there. If nutch-site.xml does not already exist, create it. 

--> 

<configuration> 

<! file properties > 

<property> 


<name>file.content.limit</name> 
<value>65536</value> 
<description>The length limit for downloaded content, in bytes. 
If this value is nonnegative (>=0), content longer than it will be 
truncated; otherwise, no truncation at all. 
</description> 
</property> 


<property> 
<name>file.content.ignored</name> 
<value>true</value> 
<description>If true, no file content will be saved during fetch. 
And it is probably what we want to set most of time, since file:// 
































URLs are meant to be local and we can always use them directly at 
Parsing and indexing stages. Otherwise file contents will be saved. 
!! NO IMPLEMENTED YET !! 

</description> 
</property> 
<!-- HTTP properties --> 
<property> 

<name>http.agent .name</name> 

<value></value> 





<description>HTTP 'User-Agent' request header. MUST NOT be empty — 
please set this to a single word uniquely related to your 
organization. 





NOTE: You should also check other related properties: 
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http.robots.agents 
http.agent.description 
http.agent.url 
http.agent.email 
http.agent.version 


and set their values appropriately. 


</description> 
</property> 


<property> 
<name>http.robots.agents</name> 
<value>*</value> 
<description>The agent strings we'll look for in robots.txt files, 
comma-separated, in decreasing order of precedence. You should 
put the value of http.agent.name as the first agent name, and keep 
the default * at the end of the list. E.g.: BlurflDev,Blurfl,* 
</description> 

</property> 








<property> 
<name>http.robots.403.allow</name> 
<value>true</value> 
<description>Some servers return HTTP status 403 (Forbidden) if 
/robots.txt doesn't exist. This should probably mean that we are 
allowed to crawl the site nonetheless. If this is set to false, 
then such sites will be treated as forbidden. 
</description> 
</property> 











<property> 
<name>http.agent.description</name> 
<value></value> 
<description>Further description of our bot- this text is used in 
the User-Agent header. It appears in parenthesis after the agent 
name. 
</description> 

</property> 





<property> 
<name>http.agent.url</name> 
<value></value> 
<description>A URL to advertise in the User-Agent header. This will 
appear in parenthesis after the agent name. Custom dictates that 
this should be a URL of a page explaining the purpose and behavior 
of this crawler. 
</description> 

</property> 





<property> 
<name>http.agent.email</name> 
<value></value> 
<description>An email address to advertise in the HTTP 'From' request 
header and User-Agent header. A good practice is to mangle this 
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address (e.g. ‘info at example dot com') to avoid spamming. 
</description> 
</property> 


<property> 
<name>http.agent.version</name> 
<value>Nutch-0.8.1</value> 
<description>A version string to advertise in the User-Agent 
header. 
</description> 
</property> 


<property> 
<name>http.timeout</name> 
<value>10000</value> 
<description>The default network timeout, in 
milliseconds. 
</description> 

</property> 





<property> 
<name>http.max.delays</name> 
<value>100</value> 
<description>The number of times a thread will delay when trying to 
fetch a page. Each time it finds that a host is busy, it will wait 
fetcher.server.delay. After http.max.delays attepts, it will give 
up on the page for now. 
</description> 
</property> 














<property> 
<name>http.content.limit</name> 
<value>65536</value> 
<description>The length limit for downloaded content, in bytes. 
If this value is nonnegative (>=0), content longer than it will be 
truncated; otherwise, no truncation at all. 
</description> 
</property> 


<property> 
<name>http.proxy.host</name> 
<value></value> 
<description>The proxy hostname. If empty, no proxy is 
used. 
</description> 
</property> 


<property> 
<name>http.proxy.port</name> 
<value></value> 
<description>The proxy port. 
</description> 

</property> 


<property> 
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<name>http.verbose</name> 
<value>false</value> 
<description>If true, HTTP will log more verbosely. 
</description> 
</property> 


<property> 
<name>http.redirect .max</name> 
<value>3</value> 
<description>The maximum number of redirects the fetcher will follow 
when trying to fetch a page. 
</description> 
</property> 


<property> 

<name>http.useHttpl1</name> 

<value>false</value> 
<description>NOTE: at the moment this works only for protocol- 
Httpclient. If true, use HTTP 1.1, if false use HTTP 1.0 
</description> 


</property> 





<!-- FTP properties --> 


<property> 
<name>ftp.username</name> 
<value>anonymous</value> 
<description>ftp login username. 
</description> 

</property> 


<property> 
<name>ftp.password</name> 
<value>anonymous@example.com</value> 
<description>ftp login password. 
</description> 

</property> 


<property> 
<name>ftp.content.limit</name> 
<value>65536</value> 
<description>The length limit for downloaded content, in bytes. 
If this value is nonnegative (>=0), content longer than it will be 
truncated; otherwise, no truncation at all. Caution: classical ftp 
RFCs never defines partial transfer and, in fact, some ftp servers 























out there do not handle client side forced close-down very well. Our 
implementation tries its best to handle such situations smoothly. 
</description> 
</property> 
<property> 


<name>ftp.timeout</name> 

<value>60000</value> 

<description>Default timeout for ftp client socket, in millisec. 
Please also see ftp.keep.connection below. 
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</description> 
</property> 


<property> 
<name>ftp.server.timeout</name> 
<value>100000</value> 
<description>An estimation of ftp server idle time, in millisec. 
Typically it is 120000 millisec for many ftp servers out there. 
Better be conservative here. Together with ftp.timeout, it is used 
to decide if we need to delete (annihilate) current ftp.client 
instance and force to start another ftp.client instance anew. This 
is necessary because a fetcher thread may not be able to obtain next 
request from queue in time (due to idleness) before our ftp client 
times out or remote server disconnects. Used only when 
ftp.keep.connection is true (please s below). 
</description> 
</property> 
































<property> 
<name>ftp.keep.connection</name> 
<value>false</value> 
<description>Whether to keep ftp connection. Useful if crawling same 
host again and again. When set to true, it avoids connection, login 
and dir list parser setup for subsequent urls. If it is set to true, 
however, you must make sure (roughly): 
(1) ftp.timeout is less than ftp.server.timeout 
(2) ftp.timeout is larger than (fetcher.threads.fetch * 
fetcher.server.delay) 
Otherwise there will be too many "delete client because idled too 
long" messages in thread logs. 
</description> 
</property> 

















<property> 
<name>ftp.follow.talk</name> 
<value>false</value> 
<description>Whether to log dialogue between our client and remote 
server. Useful for debugging. 
</description> 
</property> 


<!-- web db properties --> 


<property> 
<name>db.default.fetch.interval</name> 
<value>30</value> 
<description>The default number of days between re-fetches of a page. 
</description> 
</property> 








<property> 
<name>db.ignore.internal.links</name> 
<value>true</value> 
<description>If true, when adding new links to a page, links from 
the same host are ignored. This is an effective way to limit the 
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size of the link database, keeping only the highest quality 
links. 
</description> 

</property> 


<property> 
<name>db.ignore.external.links</name> 
<value>false</value> 
<description>If true, outlinks leading from a page to external hosts 
will be ignored. This is an effective way to limit the crawl to 
include only initially injected hosts, without creating complex 
URLFilters. 
</description> 

</property> 




















<property> 
<name>db.score.injected</name> 
<value>1.0</value> 
<description>The score of new pages added by the injector. 
</description> 
</property> 





<property> 
<name>db.score.link.external</name> 
<value>1.0</value> 
<description>The score factor for new pages added due to a link from 
another host relative to the referencing page's score. Scoring 
plugins may use this value to affect initial scores of external 
links. 
</description> 

</property> 





<property> 
<name>db.score.link.internal</name> 
<value>1.0</value> 
<description>The score factor for pages added due to a link from the 
same host, relative to the referencing page's score. Scoring plugins 
may use this value to affect initial scores of internal links. 
</description> 
</property> 








<property> 
<name>db.score.count.filtered</name> 
<value>false</value> 
<description>The score value passed to newly discovered pages is 
calculated as a fraction of the original page score divided by the 
number of outlinks. If this option is false, only the outlinks that 
passed URLFilters will count, if it's true then all outlinks will 
count. 
</description> 

</property> 

















<property> 
<name>db.max.inlinks</name> 
<value>10000</value> 
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<description>Maximum number of Inlinks per URL to be kept in LinkDb. 
If "invertlinks" finds more inlinks than this number, only the first 
N inlinks will be stored, and the rest will be discarded. 
</description> 
</property> 


<property> 
<name>db.max.outlinks.per.page</name> 
<value>100</value> 
<description>The maximum number of outlinks that we'll process for a 
page. If this value is nonnegative (>=0), at most 
db.max.outlinks.per.page outlinks will be processed for a page; 
otherwise, all outlinks will be processed. 
</description> 

</property> 














<property> 
<name>db.max.anchor.length</name> 
<value>100</value> 
<description>The maximum number of characters permitted in an anchor. 
</description> 
</property> 


<property> 
<name>db.fetch.retry.max</name> 
<value>3</value> 
<description>The maximum number of times a url that has encountered 
recoverable errors is generated for fetch. 
</description> 
</property> 





<property> 
<name>db.signature.class</name> 
<value>org.apache.nutch.crawl.MD5Signature</value> 
<description>The default implementation of a page signature. 
Signatures created with this implementation will be used for 
duplicate detection and removal. 
</description> 

</property> 








<property> 
<name>db.signature.text_profile.min_token_len</name> 
<value>2</value> 
<description>Minimum token length to be included in the signature. 
</description> 

</property> 


<property> 

<name>db.signature.text_profile.quant_rate</name> 

<value>0.01</value> 

<description>Profile frequencies will be rounded down to a multiple 
of QUANT = (int) (QUANT_RATE * maxFreq), where maxFreq is a maximum 
token frequency. If maxFreq > 1 then QUANT will be at least 2, which 
means that for longer texts tokens with frequency 1 will always be 


discarded. 
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</description> 
</property> 





<! generate properties > 


<property> 
<name>generate.max.per.host</name> 
<value>-1</value> 
<description>The maximum number of urls per host in a single 
fetchlist. -1 if unlimited. 
</description> 
</property> 





<property> 
<name>generate.max.per.host.by.ip</name> 
<value>false</value> 
<description>If false, same host names are counted. If true, 
hosts' IP addresses are resolved and the same IP-s are counted. 








—+-+-+- WARNING !!! -+-+-+- 
When set to true, Generator will create a lot of DNS lookup 
requests, rapidly. This may cause a DOS attack on 
remote DNS servers, not to mention increased external traffic 
and latency. For these reasons when using this option it is 
required that a local caching DNS be used. 
</description> 

</property> 





<!-- fetcher properties --> 


<property> 
<name>fetcher.server.delay</name> 
<value>5.0</value> 
<description>The number of seconds the fetcher will delay between 
successive requests to the same server. 
</description> 
</property> 





<property> 

<name>fetcher.max.crawl.delay</name> 

<value>30</value> 

<description> 
If the Crawl-Delay in robots.txt is set to greater than this value 
(in seconds) then the fetcher will skip this page, generating an 
error report. If set to -1 the fetcher will never skip such pages and 
will wait the amount of time retrieved from robots.txt Crawl-Delay, 
however long that might be. 

</description> 

</property> 











<property> 
<name>fetcher.threads.fetch</name> 
<value>10</value> 
<description>The number of FetcherThreads the fetcher should use. 
This is also determines the maximum number of requests that are 
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made at onc (each FetcherThread handles one connection). 
</description> 
</property> 


<property> 
<name>fetcher.threads.per.host</name> 
<value>1</value> 
<description>This number is the maximum number of threads that 
should be allowed to access a host at one time. 
</description> 
</property> 








<property> 
<name>fetcher.threads.per.host.by.ip</name> 
<value>true</value> 
<description>If true, then fetcher will count threads by IP address, 
to which the URL's host name resolves. If false, only host name will 
be used. NOTE: this should be set to the same value as 

















"generate.max.per.host.by.ip" - default settings are different only 
for reasons of backward-compatibility. 
</description> 
</property> 
<property> 


<name>fetcher.verbose</name> 
<value>false</value> 
<description>If true, fetcher will log more verbosely. 
</description> 
</property> 





<property> 
<name>fetcher.parse</name> 
<value>true</value> 
<description>If true, fetcher will parse content. 
</description> 
</property> 


<property> 
<name>fetcher.store.content</name> 
<value>true</value> 
<description>If true, fetcher will store content. 
</description> 

</property> 


<!-- indexer properties --> 


<property> 
<name>indexer.score.power</name> 
<value>0.5</value> 
<description>Determines the power of link analyis scores. Each 
pages's boost is set to <i>score<sup>scorePower</sup></i> where 
<i>score</i> is its link analysis score and <i>scorePower</i> is the 





value of this parameter. This is compiled into indexes, so, when 
this is changed, pages must be re-indexed for it to take 
effect. 
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</description> 
</property> 


<property> 
<name>indexer.max.title.length</name> 
<value>100</value> 
<description>The maximum number of characters of a title that are 
indexed. 
</description> 
</property> 


<property> 

<name>indexer.max.tokens</name> 

<value>10000</value> 

<description> 
The maximum number of tokens that will be indexed for a single field 
in a document. This limits the amount of memory required for 
indexing, so that collections with very large files will not crash 
the indexing process by running out of memory. 











Note that this effectively truncates large documents, excluding 
from the index tokens that occur further in the document. If you 
know your source documents are large, be sure to set this value 
high enough to accomodate th xpected size. If you set it to 
Integer.MAX_VALUE, then the only limit is your memory, but you 
should anticipate an OutOfMemoryError. 
</description> 
</property> 

















<property> 
<name>indexer.mergeFactor</name> 
<value>50</value> 
<description>The factor that determines the frequency of Lucene 
segment merges. This must not be less than 2, higher values increase 
indexing speed but lead to increased RAM usage, and increase the 
number of open file handles (which may lead to "Too many open files" 

errors). NOTE: the "segments" here have nothing to do with Nutch 
segments, they are a low-level data unit used by Lucene. 
</description> 

</property> 

















<property> 
<name>indexer.minMergeDocs</name> 
<value>50</value> 
<description>This number determines the minimum number of Lucene 
Documents buffered in memory between Lucene segment merges. Larger 
values increase indexing speed and increase RAM usage. 
</description> 

</property> 














<property> 
<name>indexer.maxMergeDocs</name> 
<value>2147483647</value> 
<description>This number determines the maximum number of Lucene 
Documents to be merged into a new Lucene segment. Larger values 
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increase batch indexing speed and reduce the number of Lucene 
segments, which reduces the number of open file handles; however, 
this also decreases incremental indexing performance. 
</description> 
</property> 





<property> 
<name>indexer.termIndexInterval</name> 
<value>128</value> 
<description>Determines the fraction of terms which Lucene keeps in 
RAM when searching, to facilitate random-access. Smaller values use 
more memory but make searches somewhat faster. Larger values use 
less memory but make searches somewhat slower. 
</description> 
</property> 














<!-- analysis properties --> 


<property> 
<name>analysis.common.terms.file</name> 
<value>common-terms.utf£8</value> 
<description>The name of a file containing a list of common terms 
that should be indexed in n-grams. 
</description> 

</property> 





<!-- searcher properties --> 


<property> 
<name>searcher.dir</name> 
<value>crawl</value> 
<description> 
Path to root of crawl. This directory is searched (in 
order) for either the file search-servers.txt, containing a list of 
distributed search servers, or the directory "index" containing 
merged indexes, or the directory "segments" containing segment 
indexes. 
</description> 

</property> 








<property> 
<name>searcher.filter.cache.size</name> 
<value>16</value> 
<description> 
Maximum number of filters to cache. Filters can accelerate certain 
field-based queries, like language, document format, etc. Each 
filter requires one bit of RAM per page. So, with a 10 million page 
index, a cache size of 16 consumes two bytes per page, or 20MB. 
</description> 
</property> 








<property> 
<name>searcher.filter.cache.threshold</name> 
<value>0.05</value> 
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<description> 

Filters are cached when their term is matched by more than this 
fraction of pages. For example, with a threshold of 0.05, and 10 
million pages, the term must match more than 1/20, or 50,000 pages. 
So, if out of 10 million pages, 50% of pages are in English, and 2% 
are in Finnish, then, with a threshold of 0.05, searches for 























"lang:en" will use a cached filter, while searches for "lang:fi" 
will score all 20,000 finnish documents. 
</description> 
</property> 
<property> 


<name>searcher.hostgrouping.rawhits.factor</name> 
<value>2.0</value> 
<description> 
A factor that is used to determine the number of raw hits 
initially fetched, before host grouping is done. 
</description> 
</property> 








<property> 
<name>searcher.summary.context</name> 
<value>5</value> 
<description> 
The number of context terms to display preceding and following 
matching terms in a hit summary. 
</description> 
</property> 


<property> 
<name>searcher.summary. length</name> 
<value>20</value> 
<description> 
The total number of terms to display in a hit summary. 
</description> 
</property> 





<property> 
<name>searcher.max.hits</name> 
<value>-1</value> 
<description>If positive, search stops after this many hits are 
found. Setting this to small, positive values (e.g., 1000) can make 
searches much faster. With a sorted index, the quality of the hits 
suffers little. 
</description> 
</property> 








<property> 

<name>searcher.max.time.tick_count</name> 

<value>-1</value> 

<description>If positive value is defined here, limit search time for 
every request to this number of elapsed ticks (see the tick_length 
property below). The total maximum time for any search request will 
be then limited to tick_count * tick_length milliseconds. When 
search time is exceeded, partial results will be returned, and the 
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total number of hits will be estimated. 
</description> 
</property> 


<property> 
<name>searcher.max.time.tick_length</name> 
<value>200</value> 
<description>The number of milliseconds between ticks. Larger values 








reduce the timer granularity (precision). Smaller values bring more 
overhead. 
</description> 

</property> 

<!-- URL normalizer properties --> 

<property> 


<name>urlnormalizer.class</name> 
<value>org.apache.nutch.net.BasicUrlNormalizer</value> 
<description>Name of the class used to normalize URLs. 
</description> 

</property> 





<property> 
<name>urlnormalizer.regex.file</name> 
<value>regex-normalize.xml</value> 
<description>Name of the config file used by the RegexUrlNormalizer 
class. 
</description> 

</property> 











<! mime properties > 


<property> 
<name>mime.types.file</name> 
<value>mime-types.xml</value> 
<description>Name of file in CLASSPATH containing filename extension 
and magic sequence to mime types mapping information 
</description> 
</property> 





<property> 
<name>mime.type.magic</name> 
<value>true</value> 
<description>Defines if the mime content type detector uses magic 
resolution. 
</description> 
</property> 


<!-- plugin properties --> 


<property> 
<name>plugin. folders</name> 
<value>plugins</value> 
<description>Directories where nutch plugins are located. Each 
element may be a relative or absolute path. If absolute, it is used 
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as is. If relative, it is searched for on the 
classpath.</description> 
</property> 


<property> 
<name>plugin.auto-activation</name> 
<value>true</value> 
<description>Defines if some plugins that are not activated regarding 
the plugin.includes and plugin.excludes properties must be 

automaticaly activated if they are needed by some actived plugins. 

</description> 

</property> 





<property> 
<name>plugin.includes</name> 
<value>protocol-http|urlfilter-regex|parse- (text |html|js) | index- 
basic|query-(basic|site|url) |summary-basic| scoring-opic</value> 
<description>Regular expression naming plugin directory names to 
include. Any plugin not matching this expression is excluded. 
In any case you need at least include the nutch-extensionpoints 
plugin. By default Nutch includes crawling just HTML and plain text 
via HTTP, and basic indexing and search plugins. 
</description> 
</property> 











<property> 
<name>plugin.excludes</name> 
<value></value> 
<description>Regular expression naming plugin directory names to 
exclude. 
</description> 
</property> 





<!-- parser properties -—-> 


<property> 
<name>parse.plugin. file</name> 
<value>parse-plugins.xml</value> 
<description>The name of the file that defines the associations 
between content-types and parsers. 
</description> 
</property> 








<property> 
<name>parser.character.encoding.default</name> 
<value>windows-1252</value> 
<description>The character encoding to fall back to when no other 
information is available 
</description> 
</property> 








<property> 
<name>parser.html.impl</name> 
<value>neko</value> 
<description>HTML Parser implementation. Currently the following 
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keywords are recognized: "neko" uses NekoHTML, "tagsoup" uses 
TagSoup. 
</description> 

</property> 


<property> 
<name>parser.html.form.use_action</name> 
<value>false</value> 
<description>If true, HTML parser will collect URLs from form action 
attributes. This may lead to undesirable behavior (submitting empty 
forms during next fetch cycle). If false, form action attribute will 
be ignored. 
</description> 
</property> 





<!-- urlfilter plugin properties --> 





<property> 
<name>urlfilter.regex.file</name> 
<value>regex-urlfilter.txt</value> 
<description>Name of file on CLASSPATH containing regular expressions 
used by urlfilter-regex (RegexURLFilter) plugin. 
</description> 

</property> 




















<property> 
<name>urlfilter.automaton.file</name> 
<value>automaton-urlfilter.txt</value> 
<description>Name of file on CLASSPATH containing regular expressions 
used by urlfilter-automaton (AutomatonURLFilter) plugin. 
</description> 

</property> 





<property> 
<name>urlfilter.prefix.file</name> 
<value>prefix-urlfilter.txt</value> 
<description>Name of file on CLASSPATH containing url prefixes 
used by urlfilter-prefix (PrefixURLFilter) plugin.</description> 
</property> 








<property> 
<name>urlfilter.suffix.file</name> 
<value>suffix-urlfilter.txt</value> 
<description>Name of file on CLASSPATH containing url suffixes 
used by urlfilter-suffix (SuffixURLFilter) plugin.</description> 
</property> 








<property> 
<name>urlfilter.order</name> 
<value></value> 
<description>The order by which url filters are applied. 
If empty, all available url filters (as dictated by properties 
plugin-includes and plugin-excludes above) are loaded and applied in 
system defined order. If not empty, only named filters are loaded 
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and applied in given order. For example, if this property has value: 
org.apache.nutch.net.RegexURLFilter 
org.apache.nutch.net.PrefixURLFilter 

then RegexURLFilter is applied first, and PrefixURLFilter second. 

Ss 

fe) 








ince all filters are AND'ed, filter ordering does not have impact 
n end result, but it may have performance implication, depending 
on relativ xpensiveness of filters. 

</description> 
</property> 














<!-- scoring filters properties --> 


<property> 
<name>scoring.filter.order</name> 
<value></value> 
<description>The order in which scoring filters are applied. 
This may be left empty (in which case all available scoring 
filters will be applied in the order defined in plugin-includes 
and plugin-excludes), or a space separated list of implementation 
classes. 
</description> 
</property> 























<!-—- clustering extension properties --> 


<property> 
<name>extension.clustering.hits-to-cluster</name> 
<value>100</value> 
<description>Number of snippets retrieved for the clustering 
extension if clustering extension is available and user requested 
results to be clustered. 
</description> 

</property> 











<property> 
<name>extension.clustering.extension-name</name> 
<value></value> 
<description>Use the specified online clustering extension. If empty, 
the first available extension will be used. The "name" here refers 
to an 'id' attribute of the 'implementation' element in the plugin 
descriptor XML file. 
</description> 

</property> 




















<!-—- ontology extension properties --> 


<property> 
<name>extension.ontology.extension-name</name> 
<value></value> 
<description>Use the specified online ontology extension. If empty, 
the first available extension will be used. The "name" here refers 
to an 'id' attribute of the 'implementation' element in the plugin 
descriptor XML file. 
</description> 

</property> 
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<property> 


<name>extension.ontology.urls</name> 


<value> 
</value> 
<description>Urls of owl files, s 





eparated by spaces, such as 
time.ow] 





http://www.example.com/ontology/ 
http://www.example.com/ontology/ 








space.owl 
wine.ow] 








http://www.example.com/ontology/ 
Or 

file: /ontology/time. owl 
file: /ontology/space.owl 
file: /ontology/wine. owl 
You have to make sure each url i 
By default, there is no owl file 
ontology is silently ignored. 
</description> 

</property> 




















<!-—- query-basic plugin properties 


<property> 
<name>query.url.boost</name> 
<value>4.0</value> 
<description> Used as a boost for 
</description> 

</property> 





<property> 
<name>query.anchor.boost</name> 
<value>2.0</value> 
<description> Used as a boost for 
</description> 

</property> 


<property> 
<name>query.title.boost</name> 
<value>1.5</value> 
<description> Used as a boost for 
</description> 

</property> 


<property> 
<name>query.host.boost</name> 
<value>2.0</value> 
<description> Used as a boost for 
</description> 

</property> 


<property> 
<name>query.phrase.boost</name> 
<value>1.0</value> 
<description> Used as a boost for 
Multiplied by boost for field ph 
</description> 

</property> 





s valid. 
, so query refinement based on 


--> 


url field in Lucene query. 


anchor field in Lucene query. 


title field in Lucene query. 





host field in Lucene query. 


phrase in Lucene query. 
rase is matched in. 


TS 


<!-- creative-commons plugin properties --> 


<property> 
<name>query.cc.boost</name> 
<value>0.0</value> 
<description> Used as a boost for cc field in Lucene query. 
</description> 
</property> 





<! query-more plugin properties --> 


<property> 
<name>query.type.boost</name> 
<value>0.0</value> 
<description> Used as a boost for type field in Lucene query. 
</description> 
</property> 





<1 query-site plugin properties --> 


<property> 
<name>query.site.boost</name> 
<value>0.0</value> 
<description> Used as a boost for site field in Lucene query. 
</description> 
</property> 





<!-- microformats-reltag plugin properties --> 


<property> 
<name>query.tag.boost</name> 
<value>1.0</value> 
<description> Used as a boost for tag field in Lucene query. 
</description> 
</property> 


<!-- language-identifier plugin properties --> 





<property> 
<name>lang.ngram.min.length</name> 
<value>1</value> 
<description> The minimum size of ngrams to uses to identify 
language (must be between 1 and lang.ngram.max.length). 
The larger is the range between lang.ngram.min.length and 
lang.ngram.max.length, the better is the identification, but 
the slowest it is. 
</description> 
</property> 














<property> 

<name>lang.ngram.max.length</name> 

<value>4</value> 

<description> The maximum size of ngrams to uses to identify 
language (must be between lang.ngram.min.length and 4). 
The larger is the range between lang.ngram.min.length and 
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lang.ngram.max.length, the better is the identification, but 
the slowest it is. 
</description> 
</property> 


<property> 
<name>lang.anal 
<value>2048</value> 


<description> 
language 
larger is 


the 
The 





lyze.max.length</name> 





The maximum bytes of data to uses to indentify 
(O means full content analysis). 
this value, the better is the analysis, but the 


slowest it is. 
</description> 
</property> 


<property> 
<name>query.lang.boost</name> 
<value>0.0</value> 
<description> Used as a boost for lang field in Lucene query. 
</description> 
</property> 


</configuration> 





77 


THIS PAGE INTENTIONALLY LEFT BLANK 


78 


APPENDIX B. LUCENE SCORING EXAMPLE 


The example provided below calcu lates an Overall Score(q,d) from Equation 
3.2 given the following information: 

A hypothetical query for the phrase "big bang" is conducted and docum ent D1 
was selected for analys is. For the word "big",D1lhasaterm frequency ¢f(t_in_d) 
equal to 3, an inverse docum ent frequency idf(t) equal to 2, a boost value 
boost(t. field _in_d) equal to 1 (i.e. no boost), an dalength norm alization value 
lengthNorm(t. field _in_d) equal to 5. For the word "b ang", D1 has a term frequency 
tf{(t_in_d) equal to 2, an inverse document frequency idf(t) equal to 1.5, a boost value 
boost(t. field _in_d) equal to 1 (i.e. no boost), an dalength norm alization value 
lengthNorm(t. field _in_d) equal to 5. Applying Equation 3.1, the score value 


score(q,d) for the query "big bang" in document D1 is equal to 82.5. 


Taking this one step further, an overall score value = Overall __ Score(q,d) 1s 
calculated using an overall boost value Overall Boost(d) equal to 0.12, a coordination 
factor Coord q,d) equal to 0.25 and a query normalization value gqueryNorm(q) equal to 


0.15. Document D1 is then calculated to have an overall score of 0.37125. 
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SIMULATION 3 WEB LINK GRAPH 


APPENDIX C. 


The following data is the high complexity random network generated in simulation 3 for Chapter V. 
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